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Abstract 

The following detection problem is studied, in which there are M sequences of samples out of which 
one outlier sequence needs to be detected. Each typical sequence contains n independent and identically 
distributed (i.i.d.) continuous observations from a known distribution tt, and the outlier sequence contains 
n i.i.d. observations from an outlier distribution fj,, which is distinct from tt, but otherwise unknown. A 
universal test based on KL divergence is built to approximate the maximum likelihood test, with known 
TT and unknown fj.. A data-dependent partitions based KL divergence estimator is employed. Such a KL 
divergence estimator is further shown to converge to its true value exponentially fast when the density 
ratio satisfies 0 < Ki < ^ < K 2 , where Ki and K 2 are positive constants, and this further implies that 
the test is exponentially consistent. The performance of the test is compared with that of a recently 
introduced test for this problem based on the machine learning approach of maximum mean discrepancy 
(MMD). We identify regimes in which the KL divergence based test is better than the MMD based test. 


1 Introduction 


In this paper, we study problem, in which there are M sequences of samples out of which one outlier sequence 
needs to be detected. Each typical sequence consists of n independent and identically (i.i.d.) continuous 
observations drawn from a known distribution tt, whereas the outlier sequence consists of n i.i.d. samples 
drawn from a distribution /r, which is distinct from tt, but otherwise unknown. The goal is to design a test 
to detect the outlier sequence. 

The study of such a model is very useful in many applications [T]. For example, in cognitive wireless 
networks, signals follow different distributions depending on whether the channel is busy or vacant. The 
goal in such a network is to identify vacant channels out of busy channels based on their corresponding signals 
in order to utilize the vacant channels for improving spectral efficiency. Such a problem was studied in [2] 
and [3] under the assumption that both /r and tt are known. Other applications include anomaly detection 
in large data sets ma, event detection and environment monitoring in sensor networks [a, understanding 
of visual search in humans and animals , and optimal search and target tracking . 

The outlying sequence detection problem with discrete ^ and tt was studied in [9] . A universal test based 
on generalized likelihood ratio test was proposed, and was shown to be exponentially consistent. The error 
exponent was further shown to be optimal as the number of sequences goes to infinity. The test utilizes 
empirical distributions to estimate ^ and tt, and is therefore applicable only for the case where /r and tt are 
discrete. 

In this paper, we study the case where distributions /r and tt are eontinuous and /i is unknown. We 
construct a Kullback-Leibler (KL) divergence based test, and further show that this test is exponentially 
consistent. 

Our exploration of the problem starts with the case in which both p, and tt are known, and the maximum 
likelihood test is optimal. An interesting observation is that the test statistic of the optimal test converges 
to D{p\\tt) as the sample size goes to infinity if the sequence is the outlier. This motivates the use of a 
KL divergence estimator to approximate the test statistic for the case when p is unknown. We apply a 
divergence estimator based on the idea of data-dependent partitions [10], which was shown to be consistent. 
Our first contribution here is to show that such an estimator converges exponentially fast to its true value 
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when the density ratio satisfies the boundedness condition: 0 < Ki < ^ < where Ki and K 2 are 
positive constants. We further design a KL divergence based test using such an estimator and show that the 
test is exponentially consistent. 

The rest of the paper is organized as follows. In Section[2l we describe the problem formulation. In Section 
131 we present the KL divergence based test and establish its exponential consistency. In Section 01 we review 
the maximum mean discrepancy (MMD) based test. In Section [SI we provide a numerical comparison of our 
KL divergence based test and the MMD based test. All the detailed proofs is shown in the appendix. 


2 Problem Model 

Throughout the paper, random variables are denoted by capital letters, and their realizations are denoted 
by the corresponding lower-case letters. All logarithms are with respect to the natural base. 

We study an outlier detection problem, in which there are in total M data sequences denoted by for 
1 < i < M. Each data sequence consists of n i.i.d. samples f/*\ ... ,Yn'^ drawn from either a typical 
distribution tt or an outlier distribution /r, where tt and /i are continuous^ i.e., defined on (R, Sr), and /i 7 ^ tt. 
We use the notation = {y[^\ ..., where G R denotes the fc-th observation of the i-th sequence. 
We assume that there is exactly one outlier among M sequences. If the i-th sequence is the outlier, the joint 
distribution of all the observations is given by 

n 

K(2/“”) ) = n )nM■ 

k=l 

We are interested in the scenario in which the outlier distributions /r is unknown a priori, but we know 
the typical distribution tt exactly. This is reasonable because in practical scenarios, systems typically start 
without outliers and it is not difficult to collect sufficient information about tt. 

Our goal is to build a distribution-free test to detect the outlier sequence generated by y. The the test 
can be captured by a universal rule S : tt x R^®® —>• 1,..., M, which must not depend on /r. 

The maximum error probability, which is a function of the detector and (/i, 7 r), is defined as 

e{6,TT,n)= max f pi{y^^)dy^'^, 

and the corresponding error exponent is defined as 

a{S,TT,fi)= lim-log e((5, TT, y). 

n—^oo Ji 

A test is said to be universally consistent if 

lim e((5, tt, p) = 0, 

n—^oo 

for any y ^ tt. It is said to be universally exponentially consistent if 

lim a{S, TT, p) > 0, 

for any p ^ tt. 


3 KL divergence based test 

We first introduce the optimal test when both p and tt are known, which is the maximum likelihood test. 
We then construct a KL divergence estimator, and prove its exponential consistency. Next, we employ the 
KL divergence estimator to approximate the test statistics of the optimal test for the outlying sequence 
detection problem, and construct the KL divergence based test. 


2 


3.1 Optimal test with tt and (j, known 


If both fjL and tt are known, the optimal test for the outlying sequence detection problem is the maximum 
likelihood test: 

<5ML(2/“”,7r,M) = argmax (1) 

1<2<M 

By normalizing Pi{y^'^) with 7r(2/^”), ^ is equivalent to: 


^ML(y^",7r,/r) = argmax 

l<i<M 




= arg max 

l<z<M 


1 

n 


I] log 


^iVk'’) 


argmax L,-. 

l<i<M 


where 




1 

n 


n 


Eiog 


PiVk'’) 

^iVk^) 


( 2 ) 


The following theorem characterizes the error exponent of test dm- 

Theorem 1. ^ Theorem 1] Consider the outlying sequence detection problem with both p and tt known. 

The error exponent for the maximum likelihood test CD is given by 

Q !( i 5 ML , 7 r ,^) = 2B{tt,p), 

where B[tt,p) is the Bhattacharyya distance between p and tt which is defined as 

B{tt, p) = - log p{y)iTT{y)idy'^ . 


Proof. See Appendix 1X1 □ 

Consider Li defined in ([2]). If is generated from p, Li —>• £>(^||7r) almost surely as n —>■ oo, by the Law 
of Large Numbers. Here, 

■D(Mlk) = J ^/^log^ 

is the KL divergence between p and tt. Similarly, if is generated from tt, Lj — >■ —D{tt\\p) almost surely 
as n —^ oo. If y(*) is generated from p, Li is an empirical estimate of the KL divergence between p and 
TT. This motivates us to construct a test based on an estimator of KL divergence between p and tt, if p is 
unknown. 


3.2 KL divergence estimator 


We introduce a KL divergence estimator of continuous distributions based on data-dependent partitions [10] . 

Assume that the distributionp is unknown and the distribution q is known, and bothp and q are continuous. 
A sequence of i.i.d. samples Y € M” is generated from p. We wish to estimate the KL divergence between 
p and q. We denote the order statistics of Y by {!"(!), y( 2 ),T(„)} where Y(i) < Y( 2 ) < •■• < Y(n)- We 
further partition the real line into empirically equiprobable segments as follows: 




{(-oo,!"(£„)], (Y(^„),y( 2 <?„)],...,(y(^„(T„-i)),oo)}, 


where S N < n is the number of points in each interval except possibly the last one, and = \njln\ is 
the number of intervals. A divergence estimator between the sequence Y G R" and the distribution tt was 
proposed in cni, which is given by 


tS /\Y\\ \ 1 ^nl'^ Cn , ^nj'^ 


(3) 
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where Cn = (n — in{Tn — 1 )) is the number of points in the last segment. 

The consistency of such an estimator was shown in m- Here, we characterize the convergence rate by 
introducing the following boundedness condition on the density ratio between p and 9 , i.e., 

0 < Ki < ^ < K 2 , (4) 

dq 

where Ki and K 2 are positive constants. In practice, such a boundedness condition is often satisfied, for 
example, for truncated Gaussian distributions. 

The following theorem characterizes a lower bound on the convergence rate of estimator p|). 

Theorem 2. If the density ratio between p and q satisfies (SD, and estimator ® is applied with Tn,In —>• 00 , 
as n ^ 00 , then for Ve > 0, 


Proof. See Appendix IbI □ 

Remark 1. The convergence rate of estimator © in Theorem H is equivalent to 

\Du{Y\\q)-D{p\\q)\=Oj,{n-^/^)E 

where Op denotes “bounded in probability 

3.3 Test and performance 

In this subsection, we utilize the estimator based on data-dependent partitions to construct our test. 

It is clear that if is the outlier, then Z)„(F*^®^|| 7 r) is a good estimator of ZI(/r|| 7 r), which is a positive 
constant. On the other hand, if is a typical sequence, Dn{Y^^^\q) should be close to D( 7 r|| 7 r) = 0. Based 
on this understanding and the convergence guarantee in Theorem[21 we use .D„(F*^®)|| 7 r) in place of Li in Q, 
and construct the following test for the outlying sequence detection problem: 

= arg max I)„(FW||^). (5) 


The following theorem provides a lower bound on the error exponent of (5kl, which further implies that 
Skl is universally exponentially consistent. 

Theorem 3. If the density ratio between p and tt satisfies o, then (5kl defined in © is exponentially 
consistent, and the error exponent is lower bounded as follows, 


a((5KL,7r,/r) > — 


Ki 


Ki + K2 


D\p\\^). 


( 6 ) 


Proof. See Appendix [Cl 


□ 


4 MMD-Based Test 


In this section, we introduce the MMD based test, which we previously studied in m- We will compare (5 kl 
to the MMD based test. 


iJCn = Opian): Ve > 0, 3M > 0, P(||^| > M) < e,Vn. 
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4.1 Introduction to MMD 


In this subsection, we briefly introduce the idea of mean embedding of distributions into RKHS |13| and 
the metric of MMD. Suppose 7^ is a set of probability distributions, and suppose T-L is the RKHS with an 
associated kernel fc(-, •). We define a mapping from V to % such that each distribution p gV is mapped to 
an element in 47 as follows 

Pp{-) =Ep[k{-,x)] = j k{-,x)dp{x). 


Here, Pp{-) is referred to as the mean embedding of the distribution p into the Hilbert space 'H. Due to the 
reproducing property of 47, it is clear that Ep[/] = {pp, f)f{ for all f GTL. 


In order to distinguish between two distributions p and q, Gretton et al. m introduced 
quantity of maximum mean discrepancy (MMD) based on the mean embeddings Pp and pq 
RKHS: 


the following 
of p and q in 


MMD[p,g] := \\pp - pq\\-H- 


It can be shown that 

MMD [p, g] = sup Ep [/] - E, [/]. 

Il/llw<l 

Due to the reproducing property of kernel, the following is true 

MMD^[p,q] =E[k{X,X')] - 2E[k{X,Y)] +E[k{Y,Y% 

where X and X' are independent but have the same distribution p, and Y and Y' are independent but have the 
same distribution q. An unbiased estimator of MMD^[p, g] based on g and n samples of X = {x\, X 2 , ■ ■ ■ ,Xn} 
generated from p is given as follows, 

- n n r% 

MMDl[X,q] = - --^^fc(x.,x,)+E[fc(y,r')] --^E[A:(x„K)], 

n(n— n 

*=i i/i *=i 

where Y and Y' are independent but have the same distribution g. 

4.2 Test and performance 

For each sequence we compute MMD^[K*^®\tt] for 1 < 7 < M. It is clear that if is the outlier, 
MMD^[r(®) , tt] is a good estimator of MMD^[/r,7r], which is a positive constant. On the other hand, if 
is a typical sequence, MMD^[F(®\7r] should be a good estimator of MMD^[7r,7r], which is zero. Based on 
the above understanding, we construct the following test: 

i^MMD = argmaxMMD^[F(®),7r]. (7) 

l<i<M 

The following theorem provides a lower bound on the error exponent of ^mmd, and further demonstrates 
that the test ^mmd is universally exponentially consistent. 

Theorem 4. Consider the universal outlying sequenee deteetion problem. Suppose Smmd defined in ([7]) 
applies a bounded kernel with 0 < k(x,y) < K for any {x,y). Then, the error exponent is lower bounded as 
follows, 




MMD^[p,7r] 

9X2 


( 8 ) 


Proof. See Appendix iDl 


□ 
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5 Numerical results and Discussion 


In this section, we compare the performance of i5kl a-nd (5mmd- 

We set the number of sequences M = 5. We choose the typical distribution tt = A/’(0,1), and choose the 
outlier distribution fx = A/"(0,0.2),^^(0,1.2), A/’(0,1.8), A/’(0, 2.0), respectively. In Fig. [U Fig. [2l Fig.[3]and 
Fig. a we plot the logarithm of the probability of error log Pg as a function of the sample size n. 

It can be seen that for both tests as the number of samples increases, the probability of error converges 
to zero as the sample size increases. Furthermore, log Pg decreases with n linearly, which demonstrates the 
exponential consistency of both (5kl and i5mmd ■ 

By comparing the four figures, it can be seen that as the variance of fx deviates from the variance of tt, 
i5kl outperforms i5mmd- The numerical results and theoretical lower bounds on error exponents give us 
some intuitions to identify regimes in which one test outperforms the other. As shown above, when the 
distribution /r and tt become more different from each other, Skl will outperform ^mmo- The reason is that 
for any pair of distributions, MMD is bounded between [0,2 AT ], while the KL divergence is not bounded. 
As the distributions become more different from each other, the KL divergence will increase, and the KL 
divergence based test will have a larger error exponent than MMD based test. 



Figure 1: Comparison of the performance between KL divergence and MMD based test with tt = A/"(0,1) 
and ^ = A/’(0, 0.2) 



Figure 2: Comparison of the performance between KL divergence and MMD based test with tt = A/'(0,1) 
and ^x = A/’(0,1.2) 
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Figure 3: Comparison of the performance between KL divergence and MMD based test with tt = A/"(0,1) 
and ^ = A/’(0,1.8) 



Figure 4: Comparison of the performance between KL divergence and MMD based test with tt = A/’(0,1) 
and n = A/’(0, 2) 


Appendix 

A Proof of Theorem 1 


Recall the maximum likelihood test is defined as 


Suhiy ) = argmaxlog 


Pziy 


Mn\ 


l<i<M 


nLnLi-(2/r) 




= arg max i — log 


l<i<M 






arg max L,-. 

l<i<M 


Now we will characterize the exponent for the maximum likelihood test. By the symmetry of the problem, 
it is clear that 7 ^ i} is the same for every i = 1 ,..., M, hence 

max Pi{<5ML ^ i} = Pi{^ML ^ !}■ 


It now follows 


Pi{Fi<F2}<Pi{5^1} 


Pi 


Li < 


max 

2<j<M 



< (M- l)Pi{Li < L2}. 


Since \og{M)/n —>■ 0, the left hand side and right hand side will share a same error probability exponent, so 
we just need to compute the exponent for Pi {Li < ^ 2 }- 


Let us use the notation. 


Zk = log 


( yiyk^)'^iy^k^) \ 
\T^{yk^) Kyk^)) 
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Then, we can rewrite the probability, 




./c=l 


^iVk^) 


fe=l 




<ol. 


./c=l 


Thus we can apply the Cramer’s theorem directly. 

lim -ipi ly^ Zk < na \ = Az{a), 

n—¥oo Tl ' I 


.A;=l 


for a < E(Z) = D{'k\\^) + _D(/x|| 7 r), and h.z{a) is the large-deviation rate function. 
In our case, 0 < E{Z) for So 

1 ' " 


lim-Pi < < 0 > = Az(0) = sup [ - kz(A)] . 


n—¥(x> Tl 


./c=l 


We just need to compute the log-MGF of random variable Z, 


kz(A) =logE(e^^) =logE 


^A(y(l)) 7r^(y(2)) 


Given that Y^^'> is generated from /i, 13 generated from tt, we have 




= log 


+\og( [ 


= - Cx{fj.,Tr), 


where 


Cx{p,q) ^ - log (^Jp\y)q^-yy)dy^ . 

In this case, it is easy to show that the error exponent 

sup [ - Kz (A)] = max [Ca (tt, /r) + Ca (/i, tt)] . 

A 

Since C\{p,q) is concave with A, and C\{'K,y) = Ci-\{p,'it), ([H]) is maximized when A* = i, so 
lim -i max ¥i{SML i} ='[iya.x\C\{Tr, p) + C\{p,tt)] = 2B{tt, p), 

n -100 n i=l,...,M A 

where is the Bhattacharyya distance between p and tt which is defined as 

B{TT,y) = - log (^J p{y)iTT{y)idy'^ . 


(9) 


B Proof of Theorem 2 

To show the exponential consistency of our estimator, we invoke a result by Lugosi and Nobel m, that 
specifies sufficient conditions on the partition of the space under which the empirical measure converges to 
the true measure. 










Let A he a. family of partitions of K. The maximal cell count of A is given by 

c(A) = sup |7r|, 

where |7r| denotes the number of cells in partition tt. 

The complexity of A is measured by the growth function as described below. Fix n points in R, 

x” = {xi,... ,Xn}- 

Let A{A,Xi) be the number of distinct partitions 

{/i fix”,... ,/r n x”} 

of the hnite set x" that can be induced by partitions tt = {Ii,..., Ir} € A. Define the growth function of A 
as 

which is the largest number of distinct partitions of any n-point subset of R that can be induced by the 
partitions in A. 

Lemma 1. (Lugosi and Nobel ) Let Yi,Y 2 ,... be i.i.d. random variables in R with Yi ^ fj, and let fin denote 
the empirieal probability measure based on n samples. A be any collection of partitions o/R. For each n > 1 
and every e > 0, then 

]p| sup ^ |/r„(J) -/r(/)| >el < exp(-neV32). (10) 

J 

To prove theorem 2, we consider the case when typical distribution q is known, and a given sequence 
Y G R" is independently generated from an unknown distribution p. We further assume that p and q are 
both absolutely continuous probability measures defined on (R, Sr), and satisfy 

dp 

0 < iLi < -f < iLa. 

dq 

Denote the empirical probability measure based on the sequence Y by Pn (Since Y is generated from p) 
and defined the empirical equiprobable partitions as follow. If the order statistics of Y can be expressed as 
{Yfi'j,Yf 2 ), ■ ■ ■ ,Y{n)} where Y(i) < Y( 2 ) ^ ■ A ^(n)- The real line is partitioned into empirically equivalent 

segments according to 

Ut }t=i,...,T„ = {(-oO:T(^„)]i iY(i„),Yf 2 £^-)],..., (T(^„(t„-i)): oo)}, 

where S N < n is the number of points in each interval except possibly the last one, and Tn = [u/^nj is 
the number of intervals. Assume that as n ^ oo, both Tn,in oo- So our estimator can be written as 

DniY\\q)=f^Pn{in\0gP^. 


If we denote the true equiprobable partitions based on true distribution p hy It, then 

P{It) = ^ =Pn{It). 

^ n 


The estimation error can be decomposed as 

T 

\Dn{Y\\q) - D{p\\q)\ < 




qiin 


lilt) 






lilt) 


dq 


= ei+ 62 . 
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Intuitively, 62 is the approximation error caused by numerical integration, which diminishes as Tn increases; 
ei is the estimation error caused by the difference of the empirical equivalent partitions from the true 
equiprobable partitions and the difference of the empirical probability measure on an interval from its true 
probability measure. 

In addition, 62 is only depends on T„ and distribution p and g, namely, 62 is a deterministic term, while 
ei also depends on data Y, which is random. Next, we will focus on bounding the ei term. 

Since p{It) = ^ = Pn{I't): the approximation error ei can be written as 


ei = 




Tn 




Y.Tfr{^Ogq{h)-\ogq{I'^)) 

-L r 


Tn 


, J- rj. 


Tn 


{log q{It)-log q{in) 


^ r 




where f{x) = logx, and f'{x) = 1/x, ^ is a real number between q{It) and q{It)- We utilize the mean value 
theorem to get the last inequality. 

Since ^ > mm{q{It), q{I^)}, we get 


” t=i 

= if^\qil,)-q{l^)\, 


( 11 ) 


where 

_ 

maxi<t<T„{^,^} 

To get an exponential bound for ei, we will apply lemma [T] to our problem. For our case, /" are the 
equivalent segments based on the empirical measure Pn- Suppose An is the collection of all the partitions of 
M into empirically equiprobable intervals based on n sample points. Then, from OT 

p|Eb«(/r)-p(/r)i >e| <p|^sup J2\pn{i)-pii)\>e'^ 

< exp(-neV32). (12) 

If we want to get a meaningful exponential bound, we still need to verify 2 conditions in our case: as 
n —>■ 00 , 

a) n~^c{An) 0, b) log A 2 „(.A„) 0. 

Here, 

c(.4„) = sup |7r| = Tn- 

7rG.4„ 
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Since £n = n/Tn —>■ cxd as n —>■ oo, we have that 

1 

n tn. 


0 . 


Next consider the growth function A 2 „(^„) which is defined as the largest number of distinct partitions 
of any 2n-point subset of R that can be induced by the partitions in An- Namely 

^ 2 ni-^n) = max A{An,xl'^). 

In our algorithm, the partitioning number A 2 „(^„) is the number of ways that 2n hxed points can be 
partitioned by T„ intervals. Then 

a;„(X) = 

Let h be the binary entropy function, defined as 

h{x) = —xlog(x) — (1 — x) log(l — a;),for x G ( 0 , 1 ). 

By the inequality log (*) < shit/s) , we obtain 

logA2„(X) < {2n + Tr/)h{ ^J^^ ) <3nh(^). 


As in ^ oo, the last inequality implies that 


1 


logA2„(Ai) -t 0. 


^2in 


Now, we can conclude that the inequality (IT^ is actually an exponential bound, the coefficients A^niAn) 
and will not influence the exponent. 

Since |p„(/”) -p(/r)l = 1^ - -p(^r)l and Ki < ^ < K 2 , the following holds 

P |g(/r) - q{It)\ > e| < P |E > ^1^} 

= p|^K(/n-p(/r)i>i^ie| 

< 4A;„(^„)2^(-^") exp{-nKfe‘^/S2). (13) 

Combine with m, we can control the estimation error ei + 62 with the following bound 

P{ei +62 > e} < | 9 (-ft) - q{ir)\ > e - 62 ! 

< 4A2„(Ai) 2‘'^'^"^ exp(-na^Ari (e - e2)^/32). 

Tn. 


Recall that 


a = 


Since we show that q{I//) converges to q{It) exponentially fast in (fOl) . we have 

Tn 


lim a = lim - . . 

„^oo n^oo niaXi<t<T„{^, 


= lim —-—-- - —=—- 

n-).oo niaXi<t<T„{^p;y| 

mini<t<T„{g(/t)} ^ 1 

=— --m — - Ki- 
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Finally, we can compute the error exponent, 

lim -ilog(p||I)„(r||g)-i:)(p||g)| > ej) > lim log(P{ei + 62 > e}) 

n^oo n \ ^ J / n—)-oo n 

> lim -i log |4A*„(^)2=(-^)exp(-na2ifi2(e_ 62)^32)1) 

n—)-oo 7 T, I ) 


= lini_ ( a^Kf{e - e2)^/32 - ^ log A 2 „(X) - 


= lim 


- 62 ? 
32 


Since 62 is the approximation error caused by numerical integration, lim^_j.oo 62 = 0. We prove that 
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C Proof of Theorem 3 


Recall our test is defined as 


SKhiy^"") = argmaxL)n(l^^-^^||7r). 

l<j<M 


Now we will show the test we proposed is exponentially consistent. By the symmetry of the problem, it is 
clear that Pijt^KL ^ *} is the same for every z = 1 ,..., M, hence 


max Pj{(5kl 7 ^ 0 = IPi{^kl ^ !}■ 


It now follows 


< 


< 


Pi{5kl 1} = Pi |a„(y(i)||7r) < ^ma^^||7r)| 

{M - i)Pi {a„(y«||7r) < a„(y(2)||^)| 

(M - l)Pi {a„(yW||7r) - D{^i\\'K) + a„(y(2)||^) < -Difi\\7r)] 

{M - l)Pi {|a„(y«||7r) - i?(Mlk)| + |^„(r(2)||^)| > D{fi\\7r)} 

< {M - 1 ) (Pi - z?(Mlk)| > ci^klk)} + Pi > (1 - c)^(Mlk)}) 

where c £ (0,1), so that we can optimize over c to get a tighter bound on error exponent. 

Now apply the result we proved in Theorem 2. We get 

lim -i-logPi --Dklk) > ci:>(/r|| 7 r)| <-I- 

n^oo n U J 32 \ii2 / 

lim --logPij l)„(y(2i||7r) > (1 - c)i:>(/r||7r)| < {^Wir). 

n—)-oo n I J oZ 

The optimal result is achieved when the two exponents are equal, we get: 

c* = -^ 

KI + K 2 ' 
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and the error exponent we get is ^ ^ k^K 2 ) 


a((5KL,7r,/r) > 


32 \Ki + K2 




D Proof of Theorem 4 


We first introduce the McDiarmid’s inequality which is useful in bounding the probability of error in our 
proof. 

Lemma 2 (McDiarmid’s Inequality). Let f : X'^ —>■ M &e a function such that for all i S {1,..., m}, there 
exist Ci < oo for which 


sup |/(xi , . . . , XjYi ) /(^l; • ■ • X, Xj + l, • ■ • , Xm)| ^ Ci- (14) 

Then for all probability measure p and every e > 0, 

Px (^f{X) - Ex[f{X)) >e^< exp , 

where X denotes (xi,...,Xm), Ex denotes the expectation over the m random variables Xi ~ p, and Px 
denotes the probability over these m variables. 

In order to analyze the probability of error for the test (5mmd, without loss of generality, we assume that 
the first sequence is the anomalous sequence generated by the anomalous distribution p,. Hence, 

max Pi{(5mmd i} = Pi(<^mmd ^ 1) 

= F,(3h 

M 

k^2 

For k = 1,..., M, we have, 

*,i=i 

where x and x' are i.i.d. generated from tt. We define function 

Ak (r) = MMD^[r, tt] - MMD^[F(i) , tt]. 

It can be shown that, 

E{MMD2[y(^\7r]} = MMD2[/i,7r], 

and for fc 7 ^ I, 

E{MMD2[r('=),7r]} = 0. 

For 1 < i < n and I < fc < M, affects through the following terms 

^ ’ i=i 


'^Ea;[k{y\ \x)\+E^^:^,[k{x,x')\, 
2=1 


MMD2[yW,7r] = 


1 


n{n — 1) 
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We define as with the i-th component being removed. Hence, for 1 < fc < M and 1 < i < n, 
we have 

i.e., Ak{Y^^')) satisfies the bounded difference condition in da, with Ci = —. Hence, by McDiarmid’s 
inequality, 


Pi ( MMD^ [y , tt] > MMD^ [F, tt] ) = Pi Afc (F> 0 

= Pi ^Afc (F^'^)) - MMD^ -MMD^ [^, n 

nMMD%,n] 


< exp ^ — 


9^2 


And we prove that 


a(^MMD, TJ", fi) > 


MMD^[/Li, tt] 
9^2 


(15) 


14 
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